Group Names

1. SHAHZAD AMIN

2. HAFIZ MUHAMMAD HASSAN

Topic: Stock, Dependency and Graphs

To start working on the homework we have download the stocks data. For this we have first created the CSV file named as stocks.csv which simply contains the names of the comapnies for those we have to collect the data. To do this we have also focused on collecting the 10 companies data in each category discussed in homework description. We have also noticed that instead of 10 categories there are 11 categories. Also for this part we have focused on collecting the data from January 1, 2003 through January 1, 2008.

This makes us total of 110 companies categorized into 11 Global Industry Classification Standard (GICS).

Point 2: Select a sensible portfolio of stocks and take data from January 1, 2003 through January 1, 2008, before the onset of the “financial crisis”. Build the data matrix X

Lets assume we have already all the packages. I will include them here.

require(tseries, warn.conflicts=F, quietly = TRUE)
require(igraph, warn.conflicts=F, quietly = TRUE)
require(lemon,warn.conflicts=F, quietly = TRUE)
require(GGally,warn.conflicts=F, quietly = TRUE)

After that I will just read the csv file which will have rows. I will show rows that it contains. I also have to use the lemon library to pretty print the dataframe values.

stock <- read.csv("stocks.csv")
knitr::kable(head(stock), format="markdown")
Symbol Security GICSSector GICS.Sub.Industry Location Date.first.added CIK Founded
A Agilent Technologies Inc Health Care Health Care Equipment Santa Clara, California 6/5/2000 1090872 1999
UNP Union Pacific Industrials Railroads Omaha, Nebraska 100885
AAP Advance Auto Parts Consumer Discretionary Automotive Retail Roanoke, Virginia 7/9/2015 1158449 1932
AAPL Apple Inc. Information Technology Technology Hardware, Storage & Peripherals Cupertino, California 11/30/1982 320193 1977
ABC AmerisourceBergen Corp Health Care Health Care Distributors Chesterbrook, Pennsylvania 8/30/2001 1140859 1985
ABMD ABIOMED Inc Health Care Health Care Equipment Danvers, Massachusetts 5/31/2018 815094 1981

From above we can see that Symbol column contains the companies name and GICSSector contains the category. Now I will download the data from January 1, 2003 through January 1, 2008.

symbols <- stock[["Symbol"]]
data <- NULL
options("getSymbols.yahoo.warning"=FALSE)
options("getSymbols.warning4.0"=FALSE)
for (row in symbols) {
  if (is.null(data)) {
    data <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
  }
  else {
    temp <- suppressWarnings(get.hist.quote(instrument = row, start = "2003-01-01", end = "2008-01-01", quote = "Close", provider = "yahoo", drop = TRUE))
    data <- cbind(data, temp)
  }
}
cMatrix <- as.matrix(data)

After downloading the data in cMatrix we have to assign the names to columns and we have to remove the na values and infinite values. We are replacing them with 0 instead of removing the whole row. For getting required X matrix we also have to use diff and log formula on our matrix. I will be outputing its first few rows and columns using kable function.

cMatrix <- as.matrix(data)
colnames(cMatrix) <- symbols
cMatrix[is.na(cMatrix)] <- 0
logMatrix <- diff(log(cMatrix))
logMatrix[is.na(logMatrix)] <- 0
logMatrix[is.infinite(logMatrix)] <- 0
knitr::kable(head(logMatrix[,1:20],6), format="markdown")
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
2003-01-03 -0.0047133 -0.0065585 -0.0155700 0.0067342 0.0149835 0.0231076 0.0096908 0.0074946 0.0269766 0.0042381 0.0063694 -0.0056783 -0.0027816 -0.0284840 0.0136249 0.0200253 0.0628009 0.0003165 0.0193509 -0.0029895
2003-01-06 0.0466631 0.0179333 -0.0659116 0.0000000 0.0118948 0.0275362 0.0064087 0.0184946 0.0449806 0.0596970 0.0015860 0.0027197 -0.0168544 0.0284840 0.0418107 0.0590889 0.0000000 0.0125788 0.0129210 0.0330474
2003-01-07 -0.0090589 0.0001616 0.0056436 -0.0033619 -0.0260110 -0.0849932 -0.0460053 -0.0195618 0.0357053 0.0046974 -0.0143658 -0.0094270 0.0056497 0.0422456 -0.0350673 -0.0175298 -0.0234615 -0.0084733 -0.0033841 -0.0196643
2003-01-08 -0.0497512 -0.0130083 -0.0062969 -0.0204083 -0.0372670 -0.0300159 0.0248996 -0.0069649 -0.0496148 -0.0266682 -0.0072610 -0.0085107 -0.0028208 -0.0105612 0.0018522 -0.0060240 -0.0059524 0.0022037 -0.0164050 -0.0125517
2003-01-09 0.0390776 0.0105821 0.0140589 0.0088943 -0.0128590 0.2314195 0.0032570 -0.0124426 0.0400993 0.0484199 0.0128723 0.0159604 0.0168071 0.0138388 -0.0224571 0.0070246 0.0409414 0.0125002 0.0027529 0.0363914
2003-01-10 0.0167220 -0.0035691 -0.0032269 0.0027219 -0.0113658 -0.0968498 -0.0050151 -0.0822226 0.0144140 0.0205977 -0.0088319 -0.0106952 0.0456105 -0.0245127 -0.0066461 -0.0537525 -0.0379608 -0.0125002 -0.0093217 -0.0095002

After that I will plot logMatrix which will be required X matrix.

plot(logMatrix)

Above completes point 2 part of the homework in Your Job section.

Lets move to point 3.

Point 3: With this data, consider the usual Pearson correlation coefficient between stocks, and implement the bootstrap procedure described at page 3 of our notes to build marginal correlation graphs. In particular, visualize the dynamic of the graph as ε varies, highlighting the gics sectors with annotation/node color. Draw some conclusion: is there any statistical evidence to support the claim that stocks from the same sector cluster together? Explain.

For this we have to calculate Pearson correlation coefficient between stocks. For this we have to simply use cor function.

pearson_correlation_matrix <- cor(logMatrix, method = "pearson", use = "everything")
R <- pearson_correlation_matrix
R[is.na(R)] <- 0
knitr::kable(head(R[,1:20],6), format="markdown") # outputing the rows
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI ADM ADP ADS ADSK AEE AEP AES AFL AGN AIG
A 1.0000000 0.3132293 0.2453006 0.2894047 0.2060193 0.1929294 0.2173697 0.2595466 0.3440663 0.4530926 0.1865611 0.3108620 0.2168602 0.2993274 0.2265353 0.2792155 0.2083448 0.2708431 0.2246741 0.3329415
UNP 0.3132293 1.0000000 0.2409231 0.2394450 0.2275416 0.1805622 0.2706303 0.2816294 0.2722011 0.2881027 0.2555853 0.3005328 0.2097893 0.2803825 0.2796656 0.2944920 0.2480392 0.2920236 0.1999380 0.3494798
AAP 0.2453006 0.2409231 1.0000000 0.1999078 0.1331749 0.1000599 0.1607918 0.1810424 0.1746663 0.2027267 0.1106188 0.1685443 0.1931692 0.1736631 0.1184338 0.1588764 0.1793692 0.2229857 0.1487453 0.2467757
AAPL 0.2894047 0.2394450 0.1999078 1.0000000 0.1270099 0.0927325 0.1895600 0.1360781 0.3016231 0.2960004 0.1349346 0.2564773 0.2002367 0.2784136 0.1937273 0.2194870 0.1332074 0.2081890 0.2015270 0.2546643
ABC 0.2060193 0.2275416 0.1331749 0.1270099 1.0000000 0.1144457 0.2634622 0.1995059 0.1942354 0.1777007 0.1762504 0.2228011 0.1627588 0.1479000 0.2147295 0.2136665 0.1685642 0.1420544 0.2784182 0.2002196
ABMD 0.1929294 0.1805622 0.1000599 0.0927325 0.1144457 1.0000000 0.1759887 0.1675586 0.1482111 0.1929230 0.0936166 0.1393736 0.1013807 0.1779248 0.1643561 0.2037472 0.1463400 0.1504244 0.1378275 0.1819907

From above we can see that columns and rows that have same value have higher dependency on each other usaully 1 as you can see the first row and first column value of “A” company.

Now lets move to calculating Marginal Correlation Graphs.. But first we have to calculating the bootstrap confidence interval for this we have used our own algorithm instead of using Boot function.

#Calculating bootstarp confidence interval
B = 1000 # considering only 1000 iteration for sampling purposes
brep = rep(NA, B)
for (b in 1:B){
  row_idx <- sample(1:nrow(logMatrix), replace = T)
  bSample <- logMatrix[row_idx,] # bootstrap sample
  rownames(bSample) <- rownames(logMatrix)
  sample_Cor <- cor(bSample, method = "pearson", use = "everything") # getting correlation from original matrix for sample
  sample_Cor[is.na(sample_Cor)] <- 0 # replacing values 
  btheta  <- sqrt(nrow(logMatrix))*max(sample_Cor-R)   # bootstrap replicate
  brep[b] <- btheta            # save
}

After above we have our bootstrapped replicated values. Till now we have ∆b After we will calculate the our Emperical CDF.

Gstar<- ecdf(brep)
plot(Gstar)

From above you can see that values graph changes between 0 to 1. Now I have Fn(t) hat. I am going to move t alpha based on 0.95 confidence.

confidence <- as.numeric(quantile(brep,0.95)/sqrt(nrow(logMatrix)))
plot(confidence)

From above you can notice that its value is 0.18. Lets move to building P􏰂R[j,k]∈Cj,k(α) for all (j,k)􏰃 →n 1−α for this I need build new metrix based on this condition “If we have a confidence interval Cn,α then we can put an edge whenever [−ε, +ε] ∩ Cn,α = ∅.”

I am going to a function which will create matrix for edges if based on our confidence interval and error as ee = 0.1

matrix_apply <- function(m,ee,con) {
  m2 <- m
  for (r in seq(nrow(m2))){
    for (c in seq(ncol(m2))){
      l <- m[[r,c]]-con
      h <- m[[r,c]]+con
      eel <- ee*-1
      eeh <- ee
      if(h <= eel || l >= eeh){
        m2[[r,c]]<-1
      }
      else{
        m2[[r,c]]<-0
      }
    }
  }
  return(m2)
}

ee<-0.13
m<-matrix_apply(R,ee,confidence)
knitr::kable(head(m[,1:10],6), format="markdown") 
A UNP AAP AAPL ABC ABMD ABT ACN ADBE ADI
A 1 1 0 0 0 0 0 0 1 1
UNP 1 1 0 0 0 0 0 0 0 0
AAP 0 0 1 0 0 0 0 0 0 0
AAPL 0 0 0 1 0 0 0 0 0 0
ABC 0 0 0 0 1 0 0 0 0 0
ABMD 0 0 0 0 0 1 0 0 0 0

From above we have 1 in the matrix when edge is possible and 0 otherwise. Moving to creating a graph from adjacency matrix as m and simplify it if there are some stocks which points to itself. But before doing I need a function which will color the stocks which has same GICS .

color_graph <- function(graph,stocks){
  sectors <- unique(stocks$GICSSector) # getting GICS 
  colors <- rainbow(length(sectors)) # getting colors based on GICS 

  for(i in 1:nrow(stocks)) {
    row <- stocks[i,]
    
    #assigning the colors
    for(j in 1:length(sectors)) {
      if(sectors[j] == row$GICSSector){
        V(graph)[i]$color <- colors[j] # assigning color to each stock but assign same color which are in same sector
        break
      }
    }
  }
  return (graph)
}

Lets plot the graph after using above function for coloring.

ee<- 0.13
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock))

From above you can see that with error 0.13 we get the perfect result of what we have wanted to support our claim that “stocks from the same sector cluster together”. To support this claim I have drawn the graph containing unique color for stocks that are in the same GICS

We will going to run this with different error values to see what will happen to our stocks when we increase error value ee

ee<- 0.02
for( i in 1:20){
ee <-ee + 0.02
m<- matrix_apply(R,ee,confidence)
g<-simplify(graph_from_adjacency_matrix(m,diag = FALSE,mode = "undirected"))
plot(color_graph(g,stock), main=c("Error value as ",ee))
}

From above you can see that our confidence value wa around 0.18 initailly and as we increase our error value from 0.02 to 0.42 stocks from same GICS started spreading instead of merging on same clustor. But still you can see that stocks from same GICS always clustered togethor.